PMM-14661 Fix QAN crashes, make inserts asynchronous by ademidoff · Pull Request #5499 · percona/pmm

ademidoff · 2026-06-15T04:45:46Z

Ticket number: PMM-14661

Summary

Reduce QAN read-path scan cost and stabilize ingestion without changing the schema or result semantics. A larger structural change (a pre-aggregated rollup) seems necessary, but will be reconsidered separately.

Changes

Single-scan filter dimensions (reporter.go) — queryDimension uses sumIf(<metric>, <active-dimension predicate>) in one scan, replacing the filtered + "enumerate-with-0" UNION ALL (two scans → one per dimension), roughly halving the filters-panel scans.
Asynchronous inserts (default-users.xml, low-memory-users.xml) — enable ClickHouse async_insert on the default profile so metric inserts are buffered server-side and flushed in batches, cutting per-insert overhead and write contention on the QAN tables. We rely on the default wait_for_async_insert=1, per ClickHouse's strong recommendation: the INSERT still blocks until the server flushes, so commit errors are surfaced synchronously (keeping the existing rollback/retry path in data_ingestion.go meaningful) and the server can apply backpressure instead of being overrun. wait_for_async_insert=0 would silence those errors and defeat backpressure, so it is deliberately not used.
Ingest buffer left at 100 — the in-process request queue (requestsCap in data_ingestion.go) is intentionally unchanged. It is the backpressure/decoupling mechanism between the gRPC Save handler and the single DB-writer goroutine, not the batching layer, so async_insert does not make it redundant. With wait_for_async_insert=1, slower commits correctly chain into agent-side backpressure; enlarging the queue would only mask that and increase in-memory data loss on restart. Tune via ClickHouse-side settings (e.g. async_insert_busy_timeout_ms) if the qan_api2_data_ingestion_requests_len gauge rides near the cap under load.

Considered and dropped

Parallel sparklines — an earlier revision fetched per-row sparklines concurrently (a bounded errgroup) to cut report latency. Reverted: running several sparkline aggregations at once multiplies peak ClickHouse memory during a single report, which works against the OOM/crash fix this PR targets — most acutely on the low-memory profile. It also forced a larger shared connection pool (read load could starve the ingest writer). Sparklines remain serial (one query per row), and maxOpenConns/maxIdleConns stay at their baseline.

codecov · 2026-06-15T04:47:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.53%. Comparing base (16ac903) to head (1434e43).
⚠️ Report is 3 commits behind head on v3.

Additional details and impacted files

@@            Coverage Diff             @@
##               v3    #5499      +/-   ##
==========================================
- Coverage   43.54%   43.53%   -0.01%     
==========================================
  Files         413      413              
  Lines       42928    42943      +15     
==========================================
+ Hits        18692    18697       +5     
- Misses      22363    22373      +10     
  Partials     1873     1873

Flag	Coverage Δ
admin	`34.78% <ø> (ø)`
agent	`49.15% <ø> (ø)`
managed	`42.84% <ø> (-0.01%)`	⬇️
vmproxy	`72.22% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Bring over the net changes from PR #5443 (branch PMM-15113-improve-qan-read-queries): parallel sparklines and the single-scan sumIf filter merge in the QAN reporter/profile paths, plus the accompanying reporter test and Makefile/AGENTS updates. The add-then-revert 1-hour metrics rollup churn from that PR's history is excluded; only its final net diff is applied.

JiriCtvrtka · 2026-06-16T10:16:37Z

+// run concurrently. It is kept well below the connection pool size (see maxOpenConns
+// in main.go) so one request cannot monopolize the pool — which is shared with the
+// data ingestion writer — or flood ClickHouse with concurrent scans.
+const MaxParallelQueries = 4


With this is maxOpenConns enough? It is 10 and shared with the ingest writer, report Select(), filter queries, and up to 4 parallel sparklines per report. I believe in this case few concurrent QAN users could exhaust the pool and time out.

Good catch — this is moot now. I've removed the sparkline parallelization (and MaxParallelQueries) in 1434e43.

Running several sparkline aggregations concurrently multiplies peak ClickHouse memory per report, which works against the OOM/crash fix this PR targets — most acutely on the low-memory profile — and it coupled read concurrency to the shared pool (a burst of reports could starve the single ingest writer). Sparklines are serial again and maxOpenConns/maxIdleConns stay at baseline (10/5), so the pool-exhaustion concern no longer applies.

The report-latency win will be revisited separately, likely via a pre-aggregated rollup rather than more concurrency.

…o fetch sparklines sequentially

PMM-14661 Fix QAN craches

909b12d

ademidoff mentioned this pull request Jun 15, 2026

PMM-14661-fix-qan-crashes Percona-Lab/pmm-submodules#4406

Draft

ademidoff marked this pull request as ready for review June 15, 2026 06:12

ademidoff requested a review from a team as a code owner June 15, 2026 06:12

ademidoff requested review from 4nte and JiriCtvrtka and removed request for a team June 15, 2026 06:12

JiriCtvrtka reviewed Jun 16, 2026

View reviewed changes

Comment thread build/ansible/roles/clickhouse/files/default-users.xml

JiriCtvrtka reviewed Jun 16, 2026

View reviewed changes

Comment thread build/ansible/roles/clickhouse/files/low-memory-users.xml

JiriCtvrtka reviewed Jun 16, 2026

View reviewed changes

4nte approved these changes Jun 16, 2026

View reviewed changes

PMM-14661 Add wait_for_async_insert to default profiles

f911a2e

ademidoff changed the title ~~PMM-14661 Fix QAN craches~~ PMM-14661 Fix QAN crashes Jun 16, 2026

PMM-14661 Remove MaxParallelQueries constant and refactor GetReport t…

1434e43

…o fetch sparklines sequentially

ademidoff changed the title ~~PMM-14661 Fix QAN crashes~~ PMM-14661 Fix QAN crashes, make inserts asynchronous Jun 17, 2026

JiriCtvrtka self-requested a review June 17, 2026 07:50

JiriCtvrtka approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PMM-14661 Fix QAN crashes, make inserts asynchronous#5499

PMM-14661 Fix QAN crashes, make inserts asynchronous#5499
ademidoff wants to merge 4 commits into
mainfrom
PMM-14661-fix-qan-crashes

ademidoff commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

JiriCtvrtka Jun 16, 2026

Uh oh!

ademidoff Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ademidoff commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Considered and dropped

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

JiriCtvrtka Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

ademidoff Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ademidoff commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading